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Abstract 

Traditional quantum state tomography requires a number of measurements that grows exponentially 
with the number of qubits n. But using ideas from computational learning theory, we show that "for most 
practical purposes" one can learn a state using a number of measurements that grows only linearly with 
n. Besides possible implications for experimental physics, our learning theorem has two applications to 
quantum computing: first, a new simulation of quantum one-way communication protocols, and second, 
the use of trusted classical advice to verify untrusted quantum advice. 

1 Introduction 

Suppose we have a physical process that produces a quantum state. By applying the process repeatedly, 
we can prepare as many copies of the state as we want, and can then measure each copy in a basis of our 
choice. The goal is to learn an approximate description of the state by combining the various measurement 
outcomes. 

This problem is called quantum state tomography, and it is already an important task in experimental 
physics. To give some examples, tomography has been used to obtain a detailed picture of a chemical 
reaction (namely, the dissociation of I2 molecules) ; to confirm the preparation of three-photon [35] and 
eight-ion [5D] entangled states; to test controlled-NOT gates [25]; and to characterize optical devices |15j . 

Physicists would like to scale up tomography to larger systems, in order to study the many-particle 
entangled states that arise (for example) in chemistry, condensed-matter physics, and quantum information. 
But there is a fundamental obstacle in doing so. This is that, to reconstruct an n-qubit state, one needs 
to measure a number of observables that grows exponentially in n: in particular like 4™, the number of 
parameters in a 2™ x 2™ density matrix. This exponentiality is certainly a practical problem — Haffner et 
al. [20] report that, to reconstruct an entangled state of eight calcium ions, they needed to perform 656, 100 
experiments! But to us it is a theoretical problem as well. For it suggests that learning an arbitrary state 
of (say) a thousand particles would take longer than the age of the universe, even for a being with unlimited 
computational power. This, in turn, raises the question of what one even means when talking about such a 
state. For whatever else a quantum state might be, at the least it ought to be a hypothesis that encapsulates 
previous observations of a physical system, and thereby lets us predict future observations! 

Our purpose here is to propose a new resolution of this conundrum. We will show that, to predict the 
outcomes of "most" measurements on a quantum state, it suffices to do what we call pretty-good tomography — 
requiring a number of measurements that grows only linearly with the number of qubits n. 

As a bonus, we will be able to use our learning theorem to prove two new results in quantum com- 
puting and information. The first result is a new relationship between randomized and quantum one-way 
communication complexities: namely that R 1 (/) = O (M Q 1 (/)) for any partial or total Boolean function 
/, where M is the length of the recipient's input. The second result says that trusted classical advice 
can be used to verify untrusted quantum advice on most inputs — or in terms of complexity classes, that 
HeurBQP/qpoly C HeurQM A/poly. Both of these results follow from our learning theorem in intuitively- 
appealing ways; on the other hand, we would have no idea how to prove these results without the theorem. 
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To us, this provides strong evidence that our model for quantum state learning is, if not 'the right one,' then 
certainly a right one. 

We wish to stress that the main contribution of this paper is conceptual rather than technical. All of 
the 'heavy mathematical lifting' needed to prove the learning theorem has already been done: once one has 
the appropriate setup, the theorem follows readily by combining previous results due to Bartlett and Long 
[5] and Ambainis et al. [7]. Indeed, what is surprising to us is precisely that such a basic theorem was not 
discovered earlier. 

In the remainder of this introduction, we first give a formal statement of our learning theorem, then 
answer objections to it, situate it in the context of earlier work, and discuss its implications. 

1.1 Statement of Result 

Let p be an n-qubit mixed state: that is, a 2™ x 2" Hermitian positive semidefinite matrix with Tr (p) = 1. 
By a measurement of p, we will mean a "two-outcome POVM": that is, a 2™ x 2™ Hermitian matrix E 
with eigenvalues in [0,1]. Such a measurement E accepts p with probability Tr (Ep), and rejects p with 
probability 1 — Tr (Ep) . 

Our goal will be to learn p. Our notion of "learning" here is purely operational: we want a procedure 
that, given a measurement E, estimates the acceptance probability Tr (Ep). Of course, estimating Tr (Ep) 
for every E is the same as estimating p itself, and we know this requires exponentially many measurements. 
So if we want to learn p using fewer measurements, then we will have to settle for some weaker success 
criterion. The criterion we adopt is that we should be able to estimate Tr (Ep) for most measurements 
E. In other words, we assume there is some (possibly unknown) probability distribution T> from which the 
measurements are drawnQ We are given a "training set" of measurements E\ , . . . , E m drawn independently 
from T>, as well as the approximate values of Tr (Eip) for i € {1, . . . , m}. Our goal is to estimate Tr (Ep) 
for most E's drawn from T>, with high probability over the choice of training set. 

We will show that this can be done using a number of training measurements m that grows only linearly 
with the number of qubits n, and inverse-polynomially with the relevant error parameters. Furthermore, the 
learning procedure that achieves this bound is the simplest one imaginable: it suffices to find any "hypothesis 
state" a such that Tr (Eia) ~ Tr (Eip) for all i. Then with high probability that hypothesis will "generalize," 
in the sense that Tr (Ea) ~ Tr (Ep) for most E's drawn from V. More precisely: 

Theorem 1.1 Let p be an n-qubit mixed state, let T> be a distribution over two-outcome measurements of 
p, and let £ = (E\, . . . , E m ) be a "training set" consisting of m measurements drawn independently from V . 
Also, fix error parameters e,r/,"/ > with "fe > 7r). Call £ a "good" training set if any hypothesis a that 
satisfies 

\Tr(Eia)-Tr(E iP )\ < n 

for all Ei € £ , also satisfies 

Pr [|Tr (Ea) - Tr (Ep)\ > 7] < e. 

E£T> 

Then there exists a constant K > such that £ is a good training set with probability at least 1 — 5, provided 
that 

K ( n , 2 1 , 1\ 

"f z e z \7^e- i 76 J 

1.2 Objections and Variations 

Before proceeding further, it will be helpful to answer various objections that might be raised against 
Theorem 11.11 Along the way, we will also state two variations of the theorem. 

Objection 1 By changing the goal to "pretty good tomography," Theorem \l.l\ dodges much of the quantum 
state tomography problem as ordinarily understood. 

l T> can also be a continuous probability measure; this will not affect any of our results. 
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Response. Yes, that is exactly what it does! The motivating idea is that one does not need to know the 
expectation values for all observables, only for most of the observables that will actually be measured. As an 
example, if we can only apply 1- and 2-qubit measurements, then the outcomes of 3-qubit measurements are 
irrelevant by assumption. As a less trivial example, suppose the measurement distribution T> is uniformly 
random (i.e., is the Haar measure). Then even if our quantum system is "really" in some pure state \tp), for 
reasonably large n it will be billions of years before we happen upon a measurement that distinguishes \tp) 
from the maximally mixed state. Hence the maximally mixed state is perfectly adequate as an explanatory 
hypothesis, despite being far from \tp) in the usual metrics such as trace distance. 

Of course, even after one relaxes the goal in this way, it might still seem surprising that for any state p, 
and any distribution T>, a linear amount of tomographic data is sufficient to simulate most measurements 
drawn from T>. This is the content of Theorem ll.il 

Objection 2 But to apply Theorem ] 1. 11 one needs the measurements to be drawn independently from some 
probability distribution T>. Is this not a strange assumption? Shouldn't one also allow adaptive measure- 
ments? 

Response. If all of our training data involved measurements in the {|0) , |1)} basis, then regardless of how 
much data we had, clearly we couldn't hope to simulate a measurement in the {|+) , |— )} basis! Therefore, 
as usual in learning theory, to get anywhere we need to make some assumption to the effect that the future 
will resemble the past. Such an assumption does not strike us as unreasonable in the context of quantum 
state estimation. For example, suppose that (as is often the case) the measurement process was itself 
stochastic, so that the experimenter did not know which observable was going to be measured until after 
it was measured. Or suppose the state was a "quantum program," which only had to succeed on typical 
inputs drawn from some probability distributionjj 

However, with regard to the power of adaptive measurements, it is possible to ask slightly more sophisti- 
cated questions. For example, suppose we perform a binary measurement E\ (drawn from some distribution 
T>) on one copy of an n-qubit state p. Then, based on the outcome z\ € {0, 1} of that measurement, suppose 
we perform another binary measurement Ei (drawn from a new distribution T> Zl ) on a second copy of p; 
and so on for r copies of p. Finally, suppose we compute some Boolean function f (z\, . . . , z r ) of the r 
measurement outcomes. 

Now, how many times will we need to repeat this adaptive procedure before, given E\, . . . , E r drawn as 
above, we can estimate (with high probability) the conditional probability that / {z\, . . . , z r ) — 1? If we 
simply apply Theorem 1 1.1 1 to the tensor product of all r registers, then it is easy to see that O (nr) samples 
suffice. Furthermore, using the ideas of Appendix [51 one can show that this is optimal: in other words, no 
improvement to (say) O (n + r) samples is possible. 

Indeed, even if we wanted to estimate the probabilities of all r of the measurement outcomes simultane- 
ously, it follows from the union bound that we could do this with high probability, after a number of samples 
linear in n and polynomial in r. 

We hope this illustrates how our learning theorem can be applied to more general settings than that for 
which it is explicitly stated. Naturally, there is a great deal of scope here for further research. 

Objection 3 Theorem ] 1.1\ is purely information-theoretic; as such, it says nothing about the computational 
complexity of finding a hypothesis state a. 

Response. This is correct. Using semidefinite and convex programming techniques, one can implement any 
of our learning algorithms to run in time polynomial in the Hilbert space dimension, N = 2™. This might 
be fine if n is at most 12 or so; note that "measurement complexity," and not computational complexity, has 

2 At this point we should remind the reader that the distribution T> over measurements only has to exist; it does not have 
to be known. All of our learning algorithms will be "distribution-free," in the sense that a single algorithm will work for any 
choice of T>. 
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almost always been the limiting factor in real experiments. But of course such a running time is prohibitive 
for larger n. 

Let us stress that exactly the same problem arises even in classical learning theory. For it follows from a 
celebrated result of Goldreich, Goldwasser, and Micali [H) that, if there exists a polynomial-time algorithm 
to find a Boolean circuit of size n consistent with observed data (whenever such a circuit exists), then there 
are no cryptographic one-way functions. Using the same techniques, one can show that, if there exists a 
polynomial-time quantum algorithm to prepare a state of n k qubits consistent with observed data (whenever 
such a state exists), then there are no (classical) one-way functions secure against quantum attack. The 
only difference is that, while finding a classical hypothesis consistent with data is an NP search problem[f] 
finding a quantum hypothesis is a QMA search problem. 

A fundamental question left open by this paper is whether there are nontrivial special cases of the 
quantum learning problem that can be solved, not only with a linear number of measurements, but also with 
a polynomial amount of quantum computation. 

Objection 4 The dependence on the error parameters 7 and e in Theorem ] 1.1\ looks terrible. 

Response. Indeed, no one would pretend that performing ~ measurements is practical for reasonable 
7 and e. Fortunately, we can improve the dependence on 7 and e quite substantially, at the cost of increasing 
the dependence on n from linear to n log 2 n. 

Theorem 1.2 The bound in Theorem \l.l\ can be replaced by 



for all e, 77, 7 > with 7 > rj. 

In Appendix [9j we will show that the dependence on 7 and e in Theorem 11.21 is close to optimal. 

Objection 5 To estimate the measurement probabilities Ti(Eip), one needs the ability to prepare multiple 
copies of p. 

Response. This is less an objection to Theorem 11.11 than to quantum mechanics itself! If one has only 
one copy of p, then Holevo's Theorem [21] immediately implies that not even "pretty good tomography" is 
possible. 

Objection 6 Even with unlimited copies of p, one could never be certain that the condition of Theorem \l.l\ 
was satisfied (i.e., that |Tr (Eia) — Tr (Eip)\ < r\ for every i). 

Response. This is correct, but there is no need for certainty. For suppose we apply each measurement -E 1 , 
to ^ lo ^/ n ^ copies of p. Then by a large deviation bound, with overwhelming probability we will obtain 

real numbers pi, . . . ,p m such that \pi — Tr (Eip)\ < n/2 for every i. So if we want to find a hypothesis state 
a such that |Tr (Ei<j) — Tr {Eip) \ < rj for every i, then it suffices to find a a such that \pi — Tr (Ei<r)\ < n/2 
for every i. Certainly such a a exists, for take a = p. 

Objection 7 But what if one can apply each measurement only once, rather than multiple times? In that 
case, the above estimation strategy no longer works. 

3 Interestingly, in the "representation-independent" setting (where the output hypothesis can be an arbitrary Boolean circuit), 
this problem is not known to be NP-complete. 
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Response. In Appendix [SI we will prove a learning theorem that applies directly to this "measure-once" 
scenario. The disadvantage is that the upper bound on the number of measurements will increase from 
- l/( 7 4 e 4 ) to - l/( 7 8 e 4 ). 

Theorem 1.3 Let p be an n-qubit state, let T> be a distribution over two-outcome measurements, and let 
£ = (Ei, . . . , E m ) consist of m measurements drawn independently from T>. Suppose we are given bits 
B = (b\, . . . ,b m ) , where each bi is 1 with independent probability Tr [Eip) and with probability 1 — Tr (Eip). 
Suppose also that we choose a hypothesis state a to minimize the quadratic functional 0^" (EiO~) ~ &i) • 

Then there exists a positive constant K such that 

Pr NTr (Ecr) - Tr (Ep)\ > 7] < e 

E£T> 

with probability at least 1 — 5 over £ and B , provided that 

K ( n , 2 1 , !\ 
m > -7-3 -j-* log h log -z ). 

7%^ \7 e 76 ) 



Objection 8 What if, instead of applying the "ideal" measurement E, the experimenter can only apply a 
noisy version E' ? 

Response. If the noise that corrupts E to E' is governed by a known probability distribution such as a 
Gaussian, then E' is still just a POVM, so Theorem 11.11 applies directly. If the noise is adversarial, then 
we can also apply Theorem 11.11 directly, provided we have an upper bound on |Tr (E 1 p) — Tr (Ep)\ (which 
simply gets absorbed into 77). 



Objection 9 What if the measurements have k > 2 possible outcomes? 

Response. Here is a simple reduction to the two-outcome case. Before applying the fc-outcome POVM 
E= {E^\ . . . ,E^], first choose an integer j € {1, . . . , k} uniformly at random, and then pretend that the 
POVM being applied is {eW> , I - } (i.e., ignore the other k— 1 outcomes). By the union bound, if our 
goal is to ensure that 



Pr 



k 



Tr 



(e^g 



- Tr ( E^ 



> 7 



< e 



with probability at least 1 — S, then in our upper bounds it suffices to replace every occurrence of 7 by 7/fc, 
and every occurrence of e by e/k. We believe that one could do better than this by analyzing the fc-outcome 
case directly; we leave this as an open problem[f| 

1.3 Related Work 

This paper builds on two research areas — computational learning theory and quantum information theory — 
in order to say something about a third area: quantum state estimation. Since many readers are probably 
unfamiliar with at least one of these areas, let us discuss them in turn. 



Computational Learning Theory 

4 Notice that any sample complexity bound must have at least a linear dependence on k. Here is a proof sketch: given a 
subset S C {1, . . . , k} with |S| = k/2, let \S) be a uniform superposition over the elements of S. Now consider simulating a 
measurement of \S) in the computational basis, {|1) , . . . , \k)}. It is clear that Q (k) sample measurements are needed to do 
this even approximately. 
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Computational learning theory can be understood as a modern response to David Hume's Problem of 
Induction: "if an ornithologist sees 500 ravens and all of them are black, why does that provide any grounds 
at all for expecting the 501 s * raven to be black? After all, the hypothesis that the 501 s * raven will be white 
seems equally compatible with evidence." The answer, from a learning theory perspective, is that in practice 
one always restricts attention to some class C of hypotheses that is vastly smaller than the class of logically 
conceivable hypotheses. So the real question is not "is induction possible?," but rather "what properties 
does the class C have to satisfy for induction to be possible?" 

In a seminal 1989 paper, Blumer et al. [TT] showed that if C is finite, then any hypothesis that agrees 
with O (log |C|) randomly-chosen data points will probably agree with most future data points as well. 
Indeed, even if C is infinite, one can upper-bound the number of data points needed for learning in terms 
of a combinatorial parameter of C called the VC (Vapnik-Chcrvoncnkis) dimension. Unfortunately, these 
results apply only to Boolean hypothesis classes. So to prove our learning theorem, we will need a more 
powerful result due to Bartlett and Long 9 , which upper-bounds the number of data points needed to learn 
rea/-valued hypothesis classes. 

Quantum Information Theory 

Besides results from classical learning theory, we will also need a result of Ambainis et al. 7} in quantum 
information theory. Ambainis et al. showed that, if we want to encode k bits into an n-qubit quantum 
state, in such a way that any one bit can later be retrieved with error probability at most p, then we need 
n > (1 — H (p)) k, where H is the binary entropy function. 

Perhaps the central idea of this paper is to turn Ambainis et al.'s result on its head, and see it not as 
lower-bounding the number of qubits needed for coding and communication tasks, but instead as upper- 
bounding the "effective dimension" of a quantum state to be learned. (In theoretical computer science, this 
is hardly the first time that a negative result has been turned into a positive one. A similar "lemons-into- 
lemonade" conceptual shift was made by Linial, Mansour, and Nisan [25], when they used a limitation of 
constant-depth circuits to give an efficient algorithm for learning those circuits.) 

Quantum State Estimation 

Physicists have been interested in quantum state estimation since at least the 1950's (see [27] for a 
good overview). For practical reasons, they have been particularly concerned with minimizing the number 
of measurements. However, most literature on the subject restricts attention to low-dimensional Hilbert 
spaces (say, 2 or 3 qubits), taking for granted that the number of measurements will increase exponentially 
with the number of qubits. 

There is a substantial body of work on how to estimate a quantum state given incomplete measurement 
results — see Buzek et al. [M] for a good introduction to the subject, or Buzek [13] for estimation algorithms 
that are similar in spirit to ours. But there are at least two differences between the previous work and 
ours. First, while some of the previous work offers numerical evidence that few measurements seem to 
suffice in practice, so far as we know none of it considers asymptotic complexity. Second, the previous work 
almost always assumes that an experimenter starts with a prior probability distribution over quantum states 
(often the uniform distribution), and then either updates the distribution using Bayes' rule, or else applies 
a Maximum-Likelihood principle. By contrast, our learning approach requires no assumptions about a 
distribution over states; it instead requires only a (possibly-unknown) distribution over measurements. The 
advantage of the latter approach, in our view, is that an experimenter has much more control over which 
measurements to apply than over the nature of the state to be learned. 

1.4 Implications 

The main implication of our learning theorem is conceptual: it shows that quantum states, considered as a 
hypothesis class, are "reasonable" in the sense of computational learning theory. Were this not the case, 
it would presumably strengthen the view of quantum computing skeptics [18, 24. that quantum states are 
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"inherently extravagant" objects, which will need to be discarded as our knowledge of physics expands^ 
Instead we have shown that, while the "effective dimension" of an n-qubit Hilbert space appears to be 
exponential in n, in the sense that is relevant for approximate learning and prediction this appearance is 
illusory. 

Beyond establishing this conceptual point, we believe our learning theorem could be of practical use in 
quantum state estimation, since it provides an explicit upper bound on the number of measurements needed 
to "learn" a quantum state with respect to any probability measure over observables. Even if our actual 
result is not directly applicable, we hope the mere fact that this sort of learning is possible will serve as a 
spur to further research. As an analogy, classical computational learning theory has had a large influence 
on neural networks, computer vision, and other fields]^ but this influence might have had less to do with the 
results themselves than with their philosophical moral. 

We turn now to a more immediate application of our learning theorem: solving open problems in quantum 
computing and information. 

The first problem concerns quantum one-way communication complexity. In this subject we consider a 
sender, Alice, and a receiver, Bob, who hold inputs x and y respectively. We then ask the following question: 
assuming the best communication protocol and the worst (x, y) pair, how many bits must Alice send to Bob, 
for Bob to be able to evaluate some joint function / (x, y) with high probability? Note that there is no 
back-communication from Bob to Alice. 

Let R 1 (/), and Q 1 (/) be the number of bits that Alice needs to send, if her message to Bob is randomized 
or quantum respectively^ Then improving an earlier result of Aaronson [T] , in Section [3] we are able to 
show the following: 

Theorem 1.4 For any Boolean function f (partial or total), R 1 (/) = O (MQ 1 (/)), where M is the length 
of Bob 's input. 

Intuitively, this means that if Bob's input is small, then quantum communication provides at most a 
small advantage over classical communication. 

The proof of Theorem 11.41 will rely on our learning theorem in an intuitively appealing way. Basically, 
Alice will send some randomly-chosen "training inputs," which Bob will then use to learn a "pretty good 
description" of the quantum state that Alice would have sent him in the quantum protocol. 

The second problem concerns approximate verification of quantum software. Suppose you want to 
evaluate some Boolean function / : {0, 1}™ — > {0, 1}, on typical inputs x drawn from a probability distribution 
T>. So you go to the quantum software store and purchase \ipf), a q-qubit piece of quantum software. The 
software vendor tells you that, to evaluate / (x) on any given input x € {0, l} n , you simply need to apply a 
fixed measurement E to the state \tpf) \x). However, you do not trust 1^/) to work as expected. Thus, the 
following question arises: is there a fixed, polynomial-size set of "benchmark inputs" x\, . . . , Xt, such that 
for any quantum program \ipf), if \ipf) works on the benchmark inputs then it will also work on most inputs 
drawn from 2?? 

Using our learning theorem, we will show in Appendix [7] that the answer is yes. Indeed, we will actually 
go further than that, and give an efficient procedure to test \ipf) against the benchmark inputs. The central 
difficulty here is that the measurements intended to test \ipf) might also destroy it. We will resolve this 
difficulty by means of a "Witness Protection Lemma," which might have applications elsewhere. 

In terms of complexity classes, we can state our verification theorem as follows: 

Theorem 1.5 HeurBQP/qpoly C HeurQMA/poly. 

Here BQP/qpoly is the class of problems solvable in quantum polynomial time, with help from a polynomial- 
size "quantum advice state" \ip n ) that depends only on the input length n; while QMA (Quantum Merlin- 
Arthur) is the class of problems for which a 'yes' answer admits a polynomial-size quantum proof. Then 

5 Or at least, it would suggest that the "operationally meaningful" quantum states comprise only a tiny portion of Hilbert 
space. 

6 According to Google Scholar, Valiant's original paper on the subject 1311 has been cited 1829 times, with a large fraction 
of the citations coming from practitioners. 

7 Here the superscript '1' denotes one-way communication. 
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HeurBQP/qpoly and HeurQMA/poly are the heuristic versions of BQP/qpoly and QMA/poly respectively — 
that is, the versions where we only want to succeed on most inputs rather than all of them. 



2 The Measurement Complexity of Quantum Learning 

We now prove Theorems II . II and 11.21 To do so, we first review results from computational learning theory, 
which upper-bound the number of data points needed to learn a hypothesis in terms of the "dimension" of 
the underlying hypothesis class. We then use a result of Ambainis et al. [7] to upper-bound the dimension 
of the class of n-qubit mixed states. 

2.1 Learning Probabilistic Concepts 

The prototype of the sort of learning theory result we need is the "Occam's Razor Theorem" of Blumer et 
al. which is stated in terms of a parameter called VC dimension. However, Blumer et al.'s result does 
not suffice for our purpose, since it deals with Boolean concepts, which map each element of an underlying 
sample space to {0, 1}. By contrast, we are interested in probabilistic concepts — called p-concepts by Kearns 
and Schapire [22| — which map each measurement E to a real number Tr (Ep) € [0, 1]. 

Generalizing from Boolean concepts to p-concepts is not as straightforward as one might hope. For- 
tunately, various authors [HI [HI El E3 HI] have already done most of the work for us, with results due to 
Anthony and Bartlett [8] and to Bartlett and Long [9] being particularly relevant. To state their results, we 
need some definitions. Let S be a finite or infinite set called the sample space. Then a p- concept over S is 
a function F : S — > [0, 1], and a p-concept class over S is a set of p-concepts over S. Kearns and Schapire 
[22j proposed a measure of the complexity of p-concept classes, called the fat- shattering dimension. 

Definition 2.1 Let S be a sample space, let C be a p-concept class over S, and let 7 > be a real number. 
We say a set {s±, . . . , Sk} C S is 7 -fat-shattered by C if there exist real numbers at, . . . , ak such that for all 
BC{l r .., k}, there exists a p-concept F 6 C such that for all i € {1, . . . , k}, 

(i) if i ^ B then F (s^) < a t — 7, and 

(ii) if i G B then F (sj) > <x; + 7 . 

Then the 7 -fat-shattering dimension of C , or fatg (7), is the maximum k such that some {si, . . . , Sfc} C S 
is ^-fat-shattered by C. (If there is no finite such maximum, then fate (7) = 00.) 

We can now state the result of Anthony and Bartlett. 

Theorem 2.2 (Anthony and Bartlett |8j) Let S be a sample space, let C be a p-concept class over S, 
and let T> be a probability measure over S. Fix an element F 6 C, as well as error parameters £,77,7 > 
with 7 > 77. Suppose we draw m samples X = (xt, ■ ■ ■ , x m ) independently according to T>, and then choose 
any hypothesis H 6 C such that \H (x) — F (x)\ < 77 for all x <G X. Then there exists a positive constant K 
such that 



Notice that in Theorem 12. 2\ the dependence on the fat-shattering dimension is supcrlinear. We would 
like to reduce the dependence to linear, at least when r\ is sufficiently small. We can do so using the following 
result of Bartlett and LongH 

8 The result we state is a special case of Bartlett and Long's Theorem 20, where the function F to be learned is itself a 
member of the hypothesis class C. 



Pr [\H(x) -F(x)\ > 7] < e 



with probability at least 1 — S over X , provided that 
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Theorem 2.3 (Bartlett and Long |9j) Let S be a sample space, let C be a p-concept class over S, and 
let T> be a probability measure over S. Fix a p-concept F : S — > [0,1] (not necessarily in C), as well as 
an error parameter a > 0. Suppose we draw m samples X = [x\, . . . ,x rn ) independently according to T>, 
and then choose any hypothesis H G C such that 1-^ ( x i) — F i x i)\ * s minimized. Then there exists a 

positive constant K such that 

EX [\H (x) - F (x)\] < a + inf EX [\C (x) - F (x)\] 
with probability at least 1 — 5 over X , provided that 

™>§(fate (?) lo g^ + lo 4)- 
Theorem 12. 31 has the following corollary. 
Corollary 2.4 In the statement of Theorem \2.2\ suppose > 7rj. Then the bound on m can be replaced 




Like all proofs in this paper, the proof of Corollary |2.4l is deferred to Appendix QT] 



2.2 Learning Quantum States 

We now turn to the problem of learning a quantum state. Let S be the set of two-outcome measurements 
on n qubits. Also, given an n-qubit mixed state p, let F p : S — > [0, 1] be the p-concept defined by 
F p (E) = Tr (Ep), and let C n = {F p } p be the class of all such F p 's. Then to apply Theorems l2~2l and l2~3l 
all we need to do is upper-bound fate,, (7) in terms of n and 7. We will do so using a result of Ambainis et 
al. [7], which upper-bounds the number of classical bits that can be "encoded" into n qubits. 

Theorem 2.5 (Ambainis et al. [7\) Let k and n be positive integers with k > n. For all k-bit strings y 
1 1 ■ ■ -y k; let py be an n-qubit mixed state that "encodes" y. Suppose there exist two-outcome measurements 
Ex, . . . , Ek such that for all y € {0, l} fe and i € {1, . . . , k} , 

ft) if Hi =0 then Tr (Eip y ) < p, and 

(ii) if yi = 1 then Tr (Eip y ) >l—p. 

Then n > (1 — H (p)) k, where H is the binary entropy function. 

Theorem 12. 51 has the following easy generalization. 

Theorem 2.6 Let k, n, and {p y } be as in Theorem \2.5\ Suppose there exist measurements E±, . . . ,Ek, as 
well as real numbers a±, . . . , a/., such that for all y £ {0, l} k and i G {1, . . . , k}, 

ft) if Hi = then Tr (Eip y ) < — 7, and 
(ii) ifyi = 1 then Tr (Eip y ) > a t + 7. 
Then n/^ 2 = fi(fc). 

If we interpret k as the size of a fat-shattered subset of S, then Theorem 12.61 immediately yields the 
following upper bound on fat-shattering dimension. 

Corollary 2.7 For all 7 > and n, we have fate„ (7) = O (n/7 2 ). 
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Combining Corollary 12.41 with Corollary 12. 71 we find that if je > 777, then it suffices to use 
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measurements. Likewise, combining Theorem 12.21 with Corollary 12.71 we find that if 7 > 77, then it suffices 
to use 
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measurements. This completes the proofs of Theorems 11.11 and 11.21 respectively. 



3 Application to Quantum Communication 

In this section we use our quantum learning theorem to prove a new result about one-way communication 
complexity. Here we consider two players, Alice and Bob, who hold inputs x and y respectively. For 
concreteness, let x be an iV-bit string, and let y be an M-bit string. Also, let / : Z — » {0, 1} be a Boolean 
function, where Z is some subset of {0, 1} x {0, 1} . We call / total if Z = {0, 1} x {0, 1} , and partial 
otherwise. 

We are interested in the minimum number of bits k that Alice needs to send to Bob, for Bob to be able to 
evaluate / (x, y) for any input pair (x, y) € Z. We consider three models of communication: deterministic, 
randomized, and quantum. In the deterministic model, Alice sends Bob a fc-bit string a x depending only 
on x. Then Bob, using only a x and y, must output / (x,y) with certainty. In the randomized model, 
Alice sends Bob a fc-bit string a drawn from a probability distribution T> x . Then Bob must output / (x, y) 
with probability at least | over a € T> x ^ In the quantum model, Alice sends Bob a fc-qubit mixed state 
p x . Then Bob, after measuring p x in a basis depending on y, must output / (x, y) with probability at least 
4- We use D 1 (/), R 1 (/), and Q 1 (/) to denote the minimum value of k for which Bob can succeed in the 
deterministic, randomized, and quantum models respectively. Clearly D 1 (/) > R 1 (/) > Q 1 (/) for all /. 

The question that interests us is how small the quantum communication complexity Q 1 (/) can be 
compared to the classical complexities D 1 (/) and R 1 (/). We know that there exists a total function 
/ : {0, 1}* x {0, 1}* -> {0, 1} for which D 1 (/) = N but R 1 (/) = Q 1 (/) = O (log iV)0 Furthermore, 
Gavinsky et al. [17j have recently shown that there exists a partial function / for which R 1 (/) = £1 (y^^j 
butQ 1 (/) = 0(logiV). 

On the other hand, it follows from a result of Klauck [23] that D 1 (/) = O (tfQ 1 (/)) for all total /. 
Intuitively, if Bob's input is small, then quantum communication provides at most a limited savings over 
classical communication. But does the D 1 (/) = O (MQ 1 (/)) bound hold for partial / as well? Aaronson 
PQ proved a slightly weaker result: for all / (partial or total), D 1 (/) = O (M Q 1 (/) logQ 1 (/)). Whether 
the log Q 1 (/) factor can be removed has remained an open problem for several years. 

Using our quantum learning theorem, we are able to resolve this problem, at the cost of replacing D 1 (/) 
by R 1 (/). In particular, Theorem 11.41 proved in Appendix [TT1 shows that R 1 (/) = O (MQ 1 (/)) for any 
Boolean function /. Also, Appendix [6] uses a recent result of Gavinsky et al. [17] to show that Theorem II .41 
is close to optimal — and in particular, that it cannot be improved to R 1 (/) = O (M + Q 1 (/)) . 



4 Open Problems 

Perhaps the central question left open by this paper is which classes of states and measurements can be 
learned, not only with a linear number of measurements, but also with a reasonable amount of computation. 

9 We can assume without loss of generality that Bob is deterministic, i.e. that his output is a function of a and y. 
10 This / is the equality function: / (x, y) = 1 if x = y, and / (x, y) = otherwise. 
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To give two examples, what is the situation for stabilizer states [3] or noninteracting-fermion states [5U]f"1 
On the experimental side, it would be interesting to demonstrate "pretty good tomography" in photonics, 
ion traps, NMR, or any other technology that allows the preparation and measurement of multi-qubit 
entangled states. Already for three or four qubits, complete tomography requires hundreds of measurements, 
and depending on what accuracy is needed, it seems likely that our learning approach could yield an efficiency 
improvement. How much of an improvement partly depends on how far our learning results can be improved, 
as well as on what the constant factors are. A related issue is that, while one can always reduce noisy, 
fc-outcome measurements to the noiseless, two-outcome measurements that we consider, one could almost 
certainly prove better upper bounds by analyzing realistic measurements more directly. 

One might hope for a far-reaching generalization of our learning theorem, to what is known as quantum 
process tomography. Here the goal is to learn an unknown quantum operation on n qubits by feeding it 
inputs and examining the outputs. But for process tomography, it is not hard to show that exponentially 
many measurements really are needed; in other words, the analogue of our learning theorem is false0 Still, 
it would be interesting to know if there is anything to say about "pretty good process tomography" for 
restricted classes of operations. 

Finally, our quantum information results immediately suggest several problems. First, does BQP/qpoly = 
YQP/poly? In other words, can we use classical advice to verify quantum advice even in the worst-case 
setting? Alternatively, can we give a "quantum oracle" (see [5]) relative to which BQP/qpoly 7^ YQP/poly? 
Second, can the relation R 1 (/) = O (M Q 1 (/)) be improved to D 1 (/) = O (M Q 1 (/)) for all /? Perhaps 
learning theory techniques could even shed light on the old problem of whether R 1 (/) = O (Q 1 (/)) for all 
total /. 
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6 Appendix: Optimality of Theorem 11.41 

It is easy to see that, in Theorem II. 41 the upper bound on R 1 (/) needs to depend both on M and on Q 1 (/). 
For the index functional yields a total / for which R 1 (/) is exponentially larger than M , while the recent 
results of Gavinsky et al. [T7] yield a partial / for which R 1 (/) is exponentially larger than Q 1 (/). However, 
is it possible that Theorem O could be improved to R 1 (/) = O (M + Q 1 (/))? 

Using a slight generalization of Gavinsky et al.'s result, we are able to rule out this possibility. Gavinsky 
et al. consider the following one-way communication problem, called the Boolean Hidden Matching Prob- 
lem. Alice is given a string x € {0,1}^. For some parameter a > 0, Bob is given aJV disjoint edges 
(ii,ji) , ■ ■ ■ , (i a N,jaN) hi {1, . . . , N} 2 , together with a string w G {0, l} aN . (Thus Bob's input length is 
M = (aN "log N).) Alice and Bob are promised that either 

(i) Xi e © Xj e = we (mod 2) for all I € {1, . . . , aN}, or 

(ii) x i( Xj e ^ we (mod 2) for all I € {1, . . . , aN}. 

Bob's goal is to output / = in case (i), or / = 1 in case (ii). 

It is not hard to see that Q 1 (/) = O log N) for all a > 0o What Gavinsky et al. showed is that, if 
a ~ 1/yJlogN, then R 1 (/) = £1 (^/W/aj . By tweaking their proof a bit, one can generalize their result to 

R 1 (/) = n (yWJa} for all a < l/0oglvO3 So in particular, set a :— 1/y/N. Then we obtain a partial 

Boolean function / for which M = O L/Nlog N) and Q 1 (/) = O (VN log N\ but R 1 (/) = Q (N 3 / 4 ), 

thereby refuting the conjecture that R 1 (/) = O (M + Q 1 (/)). 

As a final remark, the Boolean Hidden Matching Problem clearly satisfies D 1 (/) = f2 (TV) for all a > 0. 
So by varying a, we immediately get not only that D 1 (/) = O {M + Q 1 (/)) is false, but that Aaronson's 
bound D 1 (/) = O (MQ 1 (Z)logQ 1 (/)) [1] is tight up to a polylogarithmic term. This answers one of the 
open questions in pQ. 

13 This is the function / : {0, 1}^ X {1, . . . , JV} -* {0, 1} denned by / (an • • • x N , i) = a*. 

14 The protocol is as follows: first Alice sends the logiV-qubit quantum message Y^fLi (~^-) Xi I*)- Then Bob measures in 
a basis corresponding to , ■ ■ ■ , (iaJViiaJv). With probability 2a, Bob will learn whether Xi t © xj £ = wg for some edge 

(itijlj- So it suffices to amplify the protocol O(lfa) times. 

15 R. de Wolf, personal communication. 
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7 Appendix: Application to Quantum Advice 



Having applied our quantum learning theorem to communication complexity, in this appendix we apply the 
theorem to computational complexity. In particular, we will show how to use a trusted classical string to 
perform approximate verification of an untrusted quantum state. 

The following conventions will be helpful throughout the section. We identify a language L C {0, 1}* 
with the Boolean function L : {0, 1}* — > {0, 1} such that L (x) = 1 if and only if x £ L. Given a quantum 
algorithm A, we let P\ (\ip)) be the probability that A accepts and Pj[ be the probability that A rejects 
if given the state \ip) as input. Note that A might neither accept nor reject (in other words, output "don't 
know"), in which case P\ (\ip)) + P\ OV')) < 1- Finally, we use 7^f fe to denote a Hilbert space of k qubits, 
and poly (n) to denote an arbitrary polynomial in n. 

7.1 Quantum Advice and Proofs 

Recall that BQP, or Bounded-Error Quantum Polynomial-Time, is the class of problems efficiently solvable 
by a quantum computer. Then BQP/qpoly is a generalization of BQP, in which the quantum computer is 
given a polynomial-size "quantum advice state" that depends only on the input length n, but could otherwise 
be arbitrarily hard to prepare. More formally: 

Definition 7.1 A language L C {0,1}* is in BQP/qpoly if there exists a polynomial-time quantum algo- 
rithm A such that for all input lengths n, there exists a quantum advice state \ij) n ) £ 7Y® poly (") such that 
Pa X) (\x) IVO) > ! for all ie{0,l}". 

How powerful is this class? Aaronson [T] proved the first limitation on BQP/qpoly, by showing that 
BQP/qpoly C PostBQP/poly. Here PostBQP is a generalization of BQP in which we can "postselect" on 
the outcomes of measurements0 and /poly means "with polynomial-size classical advice." Intuitively, this 
result means that anything we can do with quantum advice, we can also do with classical advice, provided 
we are willing to use exponentially more computation time to extract what the advice is telling us. 

In addition to quantum advice, we will also be interested in quantum proofs. Compared to advice, a 
proof has the advantage that it can be tailored to a particular input x, but the disadvantage that it cannot 
be trusted. In other words, while an advisor's only goal is to help the algorithm A decide whether x £ L, a 
prover wants to convince A that x £ L. The class of problems that admit polynomial-size quantum proofs 
is called QMA (Quantum Merlin- Arthur). 

Definition 7.2 A language L is in QMA if there exists a polynomial-time quantum algorithm A such that 
for allx£ {0,1}™: 

(i) If x £ L then there exists a quantum witness \ip) £ 7Y® poly ^ 1 ^ such that P\ (|x) \<p}) > |. 
(ii) IfxgL then P\ (\x) \<p)) < § for all \<p). 
One can think of QMA as a quantum analogue of NP. 

7.2 Untrusted Advice 

To state our result in the strongest possible way, we need to define a new notion called untrusted advice, 
which might be of independent interest for complexity theory. Intuitively, untrusted advice is a "hybrid" 
of proof and advice: it is like a proof in that it cannot be trusted, but like advice in that depends only on 
the input length n. More concretely, let us define the complexity class YP, or "Yoda Polynomial-Time," 
to consist of all problems solvable in classical polynomial time with help from polynomial-size untrusted 
advice 

16 See [2] for a detailed definition, as well as a proof that PostBQP coincides with the classical complexity class PP. 

17 Here Yoda, from Star Wars, is intended to evoke a sage whose messages are highly generic ("Do or do not... there is no 
try"). One motivation for the name YP is that, to our knowledge, there had previously been no complexity class starting with 
a 'Y'. 
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PSPACE/poly 

/ \ 

PostBQP/poly QMA/qpoly 
BQP/qpoly QMA/poly 

\ / I 

YQP/poly QMA 

/ Ix" 

BQP/poly YQP 
^^BQP 

Figure 1: Some quantum advice and proof classes. The containment BQP/qpoly C PostBQP/poly was shown 
in P, while QMA/qpoly C PSPACE/poly was shown in [3]. 

Definition 7.3 A language L is in YP if there exists a polynomial-time algorithm A such that for all n: 

(i) There exists a string y n £ {0, 1} P ' 1 ^ such that A (%, y n ) outputs L (x) for all x € {0, 1}™. 
(ii) A (cc, y) outputs either L (x) or "don't know" for all x € {0, 1}™ and all y. 

From the definition, it is clear that YP is contained both in P/poly and in NPflcoNP. Indeed, while 
we are at it, let us initiate the study of YP, by mentioning four simple facts that relate YP to standard 
complexity classes. 

Theorem 7.4 

(i) ZPP C YP. 

(ii) YE = NEHcoNE, where YE is the exponential-time analogue of YP (i.e., both the advice size and the 
verifier's running time are 2°^ n ' ). 

(lii) If P = YP then E = NE n coNE. 

(iv) IfE= NE NpNP then P = YP. 

Naturally one can also define YPP and YQP, the (bounded-error) probabilistic and quantum analogues 
of YP. For brevity, we give only the definition of YQP. 

Definition 7.5 A language L is in YQP if there exists a polynomial-time quantum algorithm A such that 
for all n: 

(i) There exists a state \<p n ) € JifP ^^ suc h that P^ (x) (\x) \<p n )) > § for all x £ {0, 1}". 

(ii) P\- L(x) (\x) \<p)) < i for all x E {0, 1}" and all \<p). 

By analogy to the classical case, YQP is contained both in BQP/qpoly and in QMA n coQMA. We also 
have YQP/qpoly = BQP/qpoly, since the untrusted YQP advice can be tacked onto the trusted /qpoly advice. 
Figure [T] shows the known containments among various classes involving quantum advice and proofs. 
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7.3 Heuristic Complexity 



Ideally, we would like to show that BQP/qpoly = YQP/poly — in other words, that trusted quantum advice 
can be replaced by trusted classical advice together with untrusted quantum advice. However, we will only 
be able to prove this for the heuristic versions of these classes: that is, the versions where we allow algorithms 
that can err on some fraction of inputs^ We now explain what this means (for details, see the excellent 
survey by Bogdanov and Trevisan [12]). 

A distributional problem is a pair (L, {T> n }), where L C {0, 1}* is a language and T> n is a probability 
distribution over {0, 1}™. Intuitively, for each input length n, the goal will be to decide whether x € L with 
high probability over x drawn from D n . In particular, the class HeurP, or Heuristic-P, consists (roughly 
speaking) of all distributional problems that can be solved in polynomial time on a 1 — po iy( w ) fraction of 
inputs. 

Definition 7.6 A distributional problem (i, {!?„}) is in HeurP if there exists a polynomial-time algorithm 
A such that for all n and e > 0: 



Pr 

x£V n 



A 



(x,0^ /e ^ outputs L(x) 



>l-e. 



One can also define HeurP/poly, or HeurP with polynomial-size advice. (Note that in this context, 
"polynomial-size" means polynomial not just in n but in 1/e as well.) Finally, let us define the heuristic 
analogues of BQP and YQP. 

Definition 7.7 A distributional problem (L, {T> n }) is in HeurBQP if there exists a polynomial-time quantum 
algorithm A such that for all n and e > 0: 



Pr 

x£V n 



(x) (\x) |0)® ri 



> 



> 1-e. 



Definition 7.8 A distributional problem (L, {T> n }) is in HeurYQP if there exists a polynomial-time quantum 
algorithm A such that for all n and e > 0: 



(i) There exists a state \<p ne ) € 



) poly(n,l/e) 



such that 



Pr 

xev„ 



P 



L(x) 



\Vn,e)) > 



> 1 -£. 



(ii) The probability over x € T> n that there exists a \ip) such that P\ L ^ (\x) \(p)) > | is at most 
It is clear that HeurYQP/poly C HeurBQP/qpoly = HeurYQP/qpoly. 



£. 



7.4 Overview of Proof 

Our goal is to show that HeurBQP/qpoly = HeurYQP/poly: in the heuristic setting, trusted classical advice 
can be used to verify untrusted quantum advice. The intuition behind this result is simple: the classical 
advice to the HeurYQP verifier V will consist of a polynomial number of randomly-chosen "test inputs" 
xi, . . . , x m , as well as whether each Xi belongs to the language L. Then given an untrusted quantum advice 
state \tp), first V will check that \tp) yields the correct answers on xi, ...,x m ; only if \ip) passes this initial 
test will V use it on the input x of interest. By appealing to our quantum learning theorem, we will argue 
that any \ip) that passes the initial test must yield the correct answers for most x with high probability. 

But there is a problem: what if a dishonest prover sends a state \tp) such that, while Vs measurements 
succeed in "verifying" \<p), they also corrupt it? Indeed, even if V repeats the verification procedure many 

18 Closely related to heuristic complexity is the better-known average-case complexity. In average-case complexity one 
considers algorithms that can never err, but that are allowed to output "don't know" on some fraction of inputs. 
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times, conceivably \ip) could be corrupted by the very last repetition without V ever realizing it. Intuitively, 
the easiest way to avoid this problem is just to repeat the verification procedure a random number of times. 
To formalize this intuition, we need the following "quantum union bound," which was proved by Aaronson 
[3] based on a result of Ambainis et al. [7]. 

Proposition 7.9 (Aaronson |3j) Let E\, . . . , E m be two-outcome measurements, and suppose Tr (Eip) > 
1 — e for all i £ {1, . . . , m}. Then if we apply E\, . . . , E m in sequence to the initial state p, the probability 
that any of the Ei 's reject is at most m^fe. 

Using Proposition ^. 91 we can prove the following "Witness Protection Lemma." 

Lemma 7.10 (Witness Protection Lemma) Let £ = {Ei, . . . , E m } be a set of two-outcome measure- 
ments, and let T be a positive integer. Then there exists a test procedure Q with the following properties: 

(i) Q takes a state po as input, applies at most T measurements from £ , and then returns either "success" 
or "failure. " 

(ii) //Tr [Eipo) > 1 — e for all i, then Q succeeds with probability at least 1 — Tyfe. 

(Hi) If Q succeeds with probability at least X, then conditioned on succeeding, Q outputs a state a such that 
Tr (E t a) > 1 - for all i. 

Finally, by using Lemma 17.101 we can prove Theorem 11.51 that HeurBQP/qpoly = HeurYQP/poly C 
HeurQMA/poly. 

8 Appendix: Learning from Measurement Results 

In Section[2]we considered a model where for each measurement E, the learner is told the approximate value 
of Tr (Ep). This model suffices for our applications to quantum computing. But for other applications, it 
might be natural to ask what happens if we instead assume that for each E, the learner is merely given a 
measurement outcome: that is, a bit that is 1 with probability Tr (Ep) and with probability 1 — Tr (Ep). 
Of course, if the learner were given many such measurement outcomes for the same E, it could form an 
estimate of Tr (Ep) . But we are assuming that for each E, the learner only receives one measurement 
outcome. 

We will show that, even in this seemingly weak model, an n-qubit quantum state can still be learned 
using O (n) measurements, although the dependence on the parameters 7 and e will worsen. 

The general task we are considering — that of learning a p-concept given only samples from its associated 
probability distribution — is called learning in the p-concept model. The task was first studied in the early 
1990's by Kearns and Schapire [22], wn0 left open whether it can always be done if the fat-shattering 
dimension is finite. Alon et al. [6] answered this question affirmatively in a breakthrough a few years later. 
Unfortunately, Alon et al. never worked out the actual complexity bound implied by their result, and to the 
best of our knowledge no one else did either. Thus, our first task will be to fill this rather large gap in the 
literature. We will do so using Theorem 12.31 which was proven by Bartlett and Long [9] building on ideas 
of Alon et al. 

Theorem 8.1 Let S be a sample space, letC be a p-concept class over S , and letT> be a probability measure 
over S. Fix a p-concept F G C, as well as error parameters £,7 > 0. Suppose we are given m samples 
X = (xi, . . . ,X m ) drawn independently from T>, as well as bits B = (pi, . . . ,b m ) such that each bi is 1 with 
independent probability F(xi). Suppose also that we choose a hypothesis H £ C to minimize the quadratic 
functional Xa=i (fl ( x i) ~ ■ Then there exists a positive constant K such that 

Pr[\H(x)-F(x)\>-y]<s 
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with probability at least 1 — 5 over X and B, provided that 



We can now prove Theorem 1 1.31 that 

^ ( 1 ( n , 2 1 , 1 

\7 £ \7 % 7£ o 

measurements suffice to learn an n-qubit quantum state in the p-concept model. As in Section [21 let C n be 
the class of functions f p :S^ [0, 1] defined by f p (E) = Tr (Ep). Then the theorem follows immediately 
from Theorem 18. 11 together with the fact that 

fat,- \-^\=(>\ r 1 - (> 



10 J \{^e/10fj ~ Ws 2 
by Corollary [271 

Given a state p, Theorem 11.31 upper-bounds the number of measurements needed to estimate the mea- 
surement probabilities Tr (Ep). Can we do better if, instead of estimating the probabilities, we merely want 
to predict the outcomes themselves with nontrivial bias? In Appendix 1101 we will prove an almost-tight 
variant of Theorem 11.31 that is optimized for this prediction task. 



9 Appendix: Lower Bounds 

Having proved upper bounds on the measurement complexity of quantum learning, in this appendix we turn 
to lower bounds. Roughly speaking, we will show that there exists a measurement distribution T> for which 



measurements are necessary to learn an n-qubit state. Also, in the model of Appendix [8] — the model where 
each measurement is applied only once — we will show that 

m = Q (} {t 4+1os ~s 

measurements are necessarvP^I In particular, this means that Theorem ll.il is tight in its dependence on n, 
while Theorem II .21 is tight up to a multiplicative factor of log 2 (n/7). 
More formally: 

Theorem 9.1 Fix an integer n > and error parameters e,5, 7 S (0, 1). Then there exists a distribution 
T> over n-qubit measurements for which the following holds. 

(i) Suppose 



19 Anthony and Bartlett 8^ proved a generic lower bound on sample complexity in terms of fat-shattering dimension. However, 
their bound only implies that 

(\ ( n/7 2 1 
m = n - t^- — — + log - 

measurements are necessary. Using an argument more tailored to our problem, we were able to get rid of the log 2 (n/7 2 ) 
factor, improving the dependence on fatc n (7) ~ n/7 2 to linear. 
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Then there is no learning algorithm that, given measurements £ — (E±, . . . ,E m ) drawn independently 
from D, as well as real numbers pi, ■ ■ ■ ,p m such that |p, — Tr (Eip)\ < 7 2 /?i for all i, outputs a hypoth- 
esis state a such that 

Pr [|Tr (Ea) - Tr (Ep)\ > 7I < e 
Eev 

with probability at least 1 — 6 over £ . 
(ii) Suppose 

m = °{l (^ +1 ° 8 ^))- 

Then there is no learning algorithm that, given measurements £ — (E±, . . . ,E m ) drawn independently 
from V, as well as bits B = (bi, . . . ,b m ) where each bi is 1 with independent probability Tr (Eip), 
outputs a hypothesis state a such that 

Pr [|Tr (Ea) - Tr (Ep)\ > 7I < e 
egt> 

with probability at least 1 — S over £ and B. 

To prove Theorem 19.11 it will be helpful to introduce a new parameter that we call the fine- shattering 
dimension. This parameter is like the fat-shattering dimension but with additional restrictions. 

Definition 9.2 Let S be a sample space, let C be a p-concept class over S , and let < 7 < | and r\ > 
be real numbers. We say a set {s\, . . . , Sfc} C S is (7, 77) -fine- shattered by C if for all B C {1, . . . , k}, there 
exists a p-concept F 6 C such that for all i £ {1, . . . , k}, 

(i) if i B then i — 7 — 77 < F (sj) < i — 7. and 
(ii) ifieB then \ + 7 < F (sj) < 5+7 + 77. 

Then the (7,77)- fine- shattering dimension of C, or finec (7,77), is the maximum k such that some subset 
{si, . . . , Sk} of S is (7, 77) -fine-shattered by C. 

Clearly finec (7j v) ^ (7) for all C, 77, and 7. The following theorem lower-bounds sample complexity 
in terms of fine-shattering dimension. The proof builds on standard lower-bound arguments in computational 
learning theory, such as that of Ehrenfeucht et al. [16] . 

Theorem 9.3 Let S be a sample space, let C be a p-concept class over S, and let e, 5, 7,77 > 0. Then 
provided finec (7; if) ^ 2 and e, S < j, there exists a distribution T> over S for which the following holds. 

(i) Suppose 

/ finec (7, V) ~ 1 1 , 1 I 

to < max < , — in — > , 

\ 64e ' 4e 26 J 

and Zei F £ C. Then there is no learning algorithm that, given samples X = (x\, . . . , x m ) drawn 
independently from T>, as well as real numbers pi, . ■ ■ ,p m such that \pi — F {xi)\ < 77 for all i, outputs 
a hypothesis H such that 

Pr [\H(x)-F(x)\ >7] < £ 
with probability at least 1 — S over X . 
(ii) Suppose 

j finec (7= V) ~ 1 1 , 1 \ 

to < max < ^ — , — in — > 

\ A( 7 + 77) 2 £ 4e 26 j 
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where A is some universal constant, and let F G C. Then there is no learning algorithm that, 
given samples X — (aci, . . . , x m ) drawn independently from T>, as well as bits b\,...,b m such that 
Pr [bi = 1] = F (xi), outputs a hypothesis H such that 

Pr [\H(x) - F(x)\ > 7] < £ 

with probability at least 1 — 6 over X . 

To finish the proof of Theorem 19.11 the remaining step is to lower-bound the fine-shattering dimension 
of n-qubit quantum states. We will do so using the following result of Ambainis et al. [7j, which is basically 
the converse of Theorem [ 



Theorem 9.4 (Ambainis et al. [7]) Let h < p < 1, and let n and k be positive integers satisfying n > 
(1 — H (p)) k + 71og 2 k. Then there exist n-qubit mixed states {Py} ye { u* and measurements E%, . . . ,Ek 

such that for all y G {0, 1} and i G {1, . . . , k}: 

(i) if yi = then Tr(Eip y ) <l—p, and 

(ii) ify { = l then Tr (Eip y ) > p. 

Incidentally, the encoding scheme of Theorem 19.41 is completely classical, in the sense that the p y 's and 
EiS are both diagonal in the computational basis. However, we find it more convenient to state the result 
in quantum language. 

We will actually need a slight extension of Theorem 19.41 which bounds the Tr (Eip y )'s on both sides 
rather than only one. 

Theorem 9.5 Let ^ < p < 1, let r\ > 0, and let n and k be positive integers satisfying k < -| and n > 
(1 — H (pj) k + 71og 2 ~. Then there exist n-qubit mixed states {Py} ye ^ Q iyk , and measurements E\, .. . , Ek, 
such that for all y G {0, 1} and i G {1, . . . , A:}: 

(i) ij Vi = then 1 — p — r\ < Tr (Eip y ) < 1 — p, and 

(ii) if yi — 1 then p < Tr (Eip y ) < p + n. 

As in Section [21 let S be the set of two-outcome measurements on n qubits, and let C n be class of all 
functions F : S — > [0, 1] such that F (E) — Tr (Ep) for some n-qubit mixed state p. Then Theorem 19.51 has 
the following corollary. 



Corollary 9.6 For all positive integers n and all 7 > v/^-t™- 5 )/ 35 /^ 

( 8 7 2 \ 
nne c„ If' - ) - 



_5 7 2 

Theorem 19.11 now follows immediately by combining Theorem 19.31 with Corollary 



10 Appendix: Prediction Problems 

In this appendix we give a variant of Theorem 11.31 that is useful for prediction (as opposed to learning) 
problems — and that, as a bonus, is nearly tight. As usual, we first give a general upper bound in terms of 
the fat-shattering dimension. 
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Theorem 10.1 Let S be a sample space, let C be a p-concept class over S, and let T> be a probability 
measure over S. Fix a p-concept F G C, as well an error parameter a > 0. Suppose we are given m 
samples X — (x\,...,x m ) drawn independently from T>, as well as bits B = (b±,...,b m ) such that each 
bi is 1 with independent probability F(xi). Suppose also that we choose a hypothesis H S C to minimize 
EZi\H(xi)-bi\. Let 

A H , F (x) := H (x) (1 - F (x)) + (1 - H (x)) F (x) . 
Then there exists a positive constant K such that 

EX [Ah f (x)} < a + inf EX [A c f (x)\ 
with probability at least 1 — 5 over X and B, provided that 

m ^5( fatc (S) iog2 ^ +i °4)- 

Theorem 110.11 has the following immediate corollary. 

Corollary 10.2 Let p be an n-qubit state, let T> be a distribution over two-outcome measurements, and 
let £ = {E\, . . . , E m ) consist of m measurements drawn independently from T>. Suppose we are given bits 
B = (bi,... ,b m ), where each bi is 1 with independent probability Tr(Eip). Suppose also that we choose a 
hypothesis state a to minimize J27=i \ r ^ r (^i a ) ~ Let 

(E) := Tr (Ea) (1 - Tr (Ep)) + (1 - Tr (Ea)) Tr [Ep) . 

Then there exists a positive constant K such that 

EX [A^p [E)\ < a + inf EX [A ? . p (E)} 

with probability at least 1 — 8 over £ and B , provided that 

K ( n , 2 1 , A 
m > —z \-x log - + log - . 

or a 6 J 

Let us describe a simple application of Corollary 110.21 Given a two-outcome measurement E and an 
n-qubit state p, let E (p) e {0, 1} be the result of applying E to p — that is, E (p) = 1 with probability 
Tr (Ep) and E (p) =0 otherwise. Suppose our goal is to output a hypothesis state a that maximizes 
Pr.Eez> [E (a) = E (p)], the "average probability of agreement" between a and p. Corollary 110.21 shows that, 
by using O log 2 ~) measurements, we can get within an additive constant a of the maximum with high 
probability. 

Similarly, suppose we are given a measurement E drawn from T>, and want to guess whether E (p) will 
be or 1. Here the maximum success probability is ^EXsev [1 + |2Tr(i?p) — 1|], and is obtained by 
simply guessing 1 if Tr(Ep) > i, or if Tr (Ep) < i. Again, it follows from Corollary 110.21 that by using 
O (jjt log 2 measurements, we can get within an additive constant a of the maximum with high probability. 

Using the same arguments as in Appendix [9l one can show that Corollarv ll0.2l is tight up to the log 2 ^ 
term — in particular, that 




measurements are needed. We omit the details. 

What distinguishes this sort of prediction problem from the learning problems we have seen before is 
that, as the number of sample measurements m goes to infinity, we will not necessarily converge to the 
"true" state p. One way to see this is that, while p could be a mixed state, by convexity there is always a 
pure hypothesis state a = ("01 that does as well at the prediction task as any other hypothesis. On the 
positive side, this means that to find such a hypothesis given the measurement results, it suffices to compute 
the principal eigenvector of a 2" x 2" matrix. Unlike for the learning problems, here there is no need for 
semidefinite or convex programming. 
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11 Appendix: Proofs 



Proof of Corollary 12.41 Let S be a sample space, let C be a p-concept class over S, and let 13 be a 
probability measure over S. Then let C* be the class of p-concepts G : S — > [0,1] for which there exists 
an F £ C such that \G (x) — F (x)\ < rj for all x £ S. Also, fix a p-concept FeC. Suppose we draw m 
samples X = {x\, . . . ,x m ) independently according to 23, and then choose any hypothesis H £ C such that 
\H (x) — F(x)\ <T) for all x £ X. Then there exists a G £ C* such that G (x) = H (x) for all x £ X. This 
G is simply obtained by setting G (x) := H (x) if x £ X and G (x) :— F (x) otherwise. 
So by Theorem 12. 3i provided that 



m > 



K 



fate* 



log' 



1 



log 



we have 



EX [\H (x) - G (x)\] <a+ inf EX [\C (x) - G (x) 



with probability at least 1 — 6 over X. Here we have used the fact that G £ C* and hence 

inf EX[|C(aO-G(aO|] = 0. 
cec* xev 

Setting a := ^e, this implies by Markov's inequality that 



Pr 

xev 



\H(x)-G(x)\>^f 



< 6, 



and therefore 



Pr 

xev 



\H{x)-F{x)\> Q -l 



Since 77 < ^ < y, the above implies that 



< e. 



Pr [\H(x) -F(x)\ > 7] < £ 
xev 

as desired. 

Next we claim that fate* (a) < fate {ol— fj). The reason is simply that, if a given set a-fat-shatters C*, 
then it must also (a — ?7)-fat-shatter C by the triangle inequality. 
Putting it all together, we have 



fate* (|) < fate (| - ??) < fat c 



5 7 J V35 



(!)• 



and hence 



(676/7)' 



fate 



V35/ B 6 7 e 



676/7 



samples suffice. ■ 

Proof of Theorem 12.61 Suppose there exists such an encoding scheme with n/7 2 =0 (fc). Then consider 
an amplified scheme, where each string y £ {0, 1} is encoded by the tensor product state p® 1 . Here we set 
I := [c/7 2 ] for some c > 0. Also, for alH £ {1, . . . , k}, let E* be an amplified measurement that applies Ei 
to each of the £ copies of p y , and accepts if and only if at least ail of the E^s do. Then provided we choose 
c sufficiently large, it is easy to show by a Chernoff bound that for all y and i, 



(i) if yt = then Tr (E* pf l ) < §, and 
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(ii) if tfi = l then Tr (E*pf ) > §. 
So to avoid contradicting Theorem 12.51 we need nl > (l — H (|)) fc. But this implies that n/7 2 = 

Proof of Theorem \TA± Let / : Z -> {0, 1} be a Boolean function with Z C {0, 1}^ x {0, 1} M . Fix 
Alice's input x £ {0, 1}^, and let Z x be the set of all y £ {0, 1} M such that (x, y) 6 Z. By Yao's minimax 
principle, to give a randomized protocol that errs with probability at most | for all y E Z x , it is enough, 
for any fixed probability distribution T> over Z x , to give a randomized protocol that errs with probability at 
most i over ?/ drawn from T>W\ 

So let V be such a distribution; then the randomized protocol is as follows. First Alice chooses k inputs 
7/1, . . . , yk independently from T>, where k — O (Q 1 (/)). She then sends Bob yi, . . . , yu, together with 
/ (x, yi) for alH € {1, . . . , k}. Clearly this message requires only O (M Q 1 (/)) classical bits. We need to 
show that it lets Bob evaluate / (x, y), with high probability over y drawn from T>. 

By amplification, we can assume Bob errs with probability at most 77 for any fixed constant 77 > 0. We 
will take 77 = ^55- Also, in the quantum protocol for /, let p x be the Q 1 (/)-qubit mixed state that Alice 
would send given input x, and let E y be the measurement that Bob would apply given input y. Then 
Tr {Ey Px ) > 1 - r? if / (x, y) = 1, while Tr {E y p x ) < r? if / {x, y) = 0. 

Given Alice's classical message, first Bob finds a Q 1 (/)-qubit state a such that |Tr (E Vi a) — f (x, yi)\ <r) 
for all i € {1, . . . , k}. Certainly such a state exists (for take a = p x ), and Bob can find it by search- 
ing exhaustively for its classical description. If there are multiple such states, then Bob chooses one 
in some arbitrary deterministic way (for example, by lexicographic ordering). Note that we then have 
|Tr {E yi a) - Tr (E yi p x )\ < r) for all i € {1, ... , k} as well. Finally Bob outputs / (x, y) = 1 if Tr (E y a) > \, 
or f(x,y)=0 if Tr(£»<i. 

Set e = 5 = i and 7 = 0.42, so that -ye = 7r]. Then by Theorem flTl 

Pr [\Tv(E y a)-Ti(E y p x )\ > 7] >e 
yev 

with probability at most 5 over Alice's classical message, provided that 

So in particular, there exist constants A, B such that if k > A Q 1 (/) + B, then 

Pr [\Tr(E y a)-f(x,y)\ >1 + r ,}>e 
yev 

with probability most 5. Since 7 + 77 < i, it follows that Bob's classical strategy will fail with probability 
at most e + S = -| over y drawn from T>. ■ 
Proof of Theorem 17.41 

(i) Similar to the proof that BPP C P/poly. Given a ZPP machine M, first amplify M so that its failure 
probability on any input of length n is at most 2~ 2n . Then by a counting argument, there exists a 
single random string r n that causes M to succeed on all 2 n inputs simultaneously. Use that r n as the 
YP machine's advice. 

(ii) YE C NE(1 coNE is immediate. For NE n coNE C YE, first concatenate the NE and coNE witnesses for 
all 2™ inputs of length 71, then use the resulting string (of length 2°(™)) as the YE machine's advice. 

(iii) If P = YP then E = YE by padding. Hence E = NE n coNE by part (ii). 

20 If we care about optimizing the constant under the f! (k), then we are better off avoiding amplication and instead proving 
Theorem 12.61 directly using the techniques of Ambainis et al. [7|. Doing so, we obtain n/7 2 > 2fc/ln2. 

21 Indeed, it suffices to give a deterministic protocol that errs with probability at most ^ over y drawn from X>, a fact we will 
not need. 
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(iv) Let M be a YP machine, and let y n be the lexicographically first advice string that causes M to 
succeed on all 2™ inputs of length n. Consider the following computational problem: given integers 
(n, i) encoded in binary, compute the i th bit of y n . We claim that this problem is in NE NP . For an 

pipNP n 

NE lx,r machine can first guess y n , then check that it works for all x € {0, 1} using NP queries, then 

check that no lexicographically earlier string also works using NP NP queries, and finally return the i th 
np np 

bit of y n . So if E — NE , then the problem is in E, which means that an E machine can recover y n 
itself by simply looping over all i. So if n and i take only logarithmically many bits to specify, then a 
P machine can recover y n . Hence P = YP. 

■ 

Proof of Lemma 17.101 The procedure Q is given by the following pseudocode: 
Let p := p 

Choose t £ {1, . . . , T} uniformly at random 
For u := 1 to t 

Choose «€{l,...,m} uniformly at random 

Apply Ei to p 

If Ei rejects, return "FAILURE" and halt 
Next u 

Return "SUCCESS" and output a := p 

Property (ii) follows immediately from Proposition 17.91 For property (in), let p u be the state of p 
immediately after the u th iteration, conditioned on iterations 1, ...,u all succeeding. Also, let f3 u := 
maxi {1 — Tr (Eip u )}. Then Q fails in the (u + l) st iteration with probability at least (3 u /m, conditioned on 
succeeding in iterations 1, . . . ,it. So letting pt be the probability that Q completes all t iterations, we have 

„<(!-*). ..(i-*^). 

\ TO J \ TO J 

Hence, letting z > be a parameter to be determined later, 

E »s E R)-(-¥) 

t : fit>* t : t >z V / \ / 

<- e n 

t : t >z u<t : fS u >z x ' 
oo ^ 

~ ^ m) 

f=0 

m 



Also, by the assumption that Q succeeds with probability at least A, we have ^ ^2 t pt > A. So for all i, 
l-Tr(E i a)-^ Pt{1 - Tl{EiPt)) 



^ E t :0 t <zPta-^(E lPt )) | Z t:fjt>z Pt{l-Tr(E iPt )) 

< Yn ■. /3 t <zPtPt m/z 
m/z 
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The last step is to set z := y/j^, thereby obtaining the optimal lower bound 

Tr (Eta) > 1-2^ — . 

■ 

Proof of Theorem 11.51 Fix a distributional problem (L,{V n }) 6 HeurBQP/qpoly. Then there exists 
a polynomial-time quantum algorithm A such that for all n and e > 0, there exists a state |V>n,e) of s i ze 
q = O (poly (n, 1/e)) such that 



Pr 

XET> n 



Pl {X) {\x) KM)>~ 



> 1 



Let P* be the distribution obtained by starting from V n and then conditioning on P^ x ^ (\x) \tp n ,e}) > f ■ 
Then our goal will be to construct a polynomial-time verification procedure V such that, for all n and e > 0, 
there exists an advice string a n _ £ £ {0, ij. poly ( n,1 / e ) for which the following holds. 

)poly(n,l/e) 



There exists a state \<p n ,e) € TL 2 ' such that 

21 

> 1 



Pr 

xev* 



P^ x) (\ X ) |^ n , E ) |On, e )) > H 



• The probability over x £ X>* that there exists a state such that Py L ^ (\x) \ip) \a n ,e)) > 3 is at 
most e. 

If V succeeds with probability at least 1 — e over x £ then by the union bound it succeeds with 
probability at least 1 — 2e over x € 2? rl . Clearly this suffices to prove the theorem. 

As a preliminary step, let us replace A by an amplified algorithm A* , which takes \il>n,e)® e as advice and 
returns the majority answer among £ invocations of A. Here t is a parameter to be determined later. By 
a Chernoff bound, 

,..1 > 



Pr 



'PJW (k)|^,e)^) >l-e-*/ 18 " 
We now describe the verifier V. The verifier receives three objects as input: 

• An input x E {0, 1}™. 

• An untrusted quantum advice state |^o)- This \ipo) is divided into I registers, each with q qubits. 
The state that the verifier expects to receive is |^o) — IVVe)^- 

• A trusted classical advice string a rhe . This a n<£ consists of m test inputs x%, . . . ,x m £ {0, l} n , 
together with L (#,) for i £ {1, . . . , m}. Here m is a parameter to be determined later. 

Given these objects, V does the following, where T is another parameter to be determined later. 

Phase 1: Verify \(po) 
Let \ip) := \ip a ) 

Choose t £ {1, . . . , T} uniformly at random 
For u := 1 to t 

Choose i £ {1, . . . , m} uniformly at random 

Simulate A* (\xi) \(p)) 

If A* outputs 1 — L (xi) , output "don't know" and halt 
Next u 

Phase 2: Decide whether x £ L 
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Simulate A* (\x) \<p)) 

Accept if A* outputs 1; reject otherwise 

It suffices to show that there exists a choice of test inputs xx, . . . , x m , as well as parameters £, m, and T, 
for which the following holds. 

(a) If \ipo) = \ip nt£ )® e , then Phase 1 succeeds with probability at least |. 

(b) If Phase 1 succeeds with probability at least |, then conditioned on its succeeding, P^f^ (\xi) \tp)) > y| 
for alH S {1, . . . , to}. 

(c) If P^ x} (\ Xi ) \ip}) > g for alH € {1, . . . , to}, then 



Pr 



> i 



/i 'i , 5 12 



For conditions (a)-(c) ensure that the following holds with probability at least 1 — e over x £ X>*. First, 
if |<p ) = l^n.s)® , then 

P^(|x)K)K e ))> g ^ 

by the union bound. Here ^ is the maximum probability of failure in Phase 1, while | is the minimum 
probability of success in Phase 2. Second, for all \<po), either Phase 1 succeeds with probability less than -|, 



or else Phase 2 succeeds with probability at least |. Hence 

pl-L(x) ^ \an, e )) < max 



Therefore V is a valid HeurYQP/poly verifier as desired. 
Set 



H Li 

3 ' 6 J 3' 



£ 
T 



K q -log* q -, 

£ £ 

100 + 9 In to, 
3888m, 



where K > is a sufficiently large constant and q is the number of qubits of \ip n , E ). Also, form the advice 
string a„ i£ by choosing x\, . . . , x m independently from 2?* . We will show that conditions (a)-(c) all hold with 
high probability over the choice of xi, . . . , x m — and hence, that there certainly exists a choice of x\, . . . , x m 
for which they hold. 



To prove (a), we appeal to part (ii) of Lemma 17. 101 Setting e := e e / 18 , we have P^i^ (\ x i) l^n.e) 
1 — e for all i S {1, . . . , to}. Therefore Phase 1 succeeds with probability at least 



> 



1 - T^fe = 1 - 3888to • e~ £/9 > -. 

6 

To prove (b), we appeal to part (iii) of Lemma [7.101 Set A := |. Then if Phase 1 succeeds with 
probability at least A, for all i we have 



(Ixi) lip)) > 1 - 2J— = 1 - 2\l-^- = — . 
A U 11 wn ~ \J XT V 3888m 18 

Finally, to prove (c), we appeal to Theorem II. 21 Set r] := Then for all i we have 



P^ ) (\x i )\ V >))>^ = l-r ) , 
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and also 
Hence 



P 



[i X) (\Xi) \^n,ef e )>l-e- e ^>l- V . 
^ ) (M|^»-P^ ) (M|^, £ r)|< ?? . 



Now set 7 := i and 8 := i. Then 7 > ?y and 



e 
3* 



log 2 



(7-f?) 2 (7-??)' 



log- 



So Theorem 11.21 implies that 



Pr 



p^ x \\ X )\^)-p^(\x) 



> 7 



< £ 



and hence 



Pr 



P%*\\x) |V»<§ 



< e 



with probability at least 1 — 5 over the choice of a„ e . Here we have used the facts that 



P 



1 1 1 1 



and that + 7 — i s >, — ,,■ - 
Proof of Theorem 18.11 Let T>* be the distribution over (x, 6) E S x {0, 1} obtained by first drawing x 
from T>, and then setting 6=1 with probability F (x) and 6 = with probability 1 — F (x). Then we can 
imagine that each (xi, 6j) was drawn from T>*. Also, given a hypothesis H £ C, let H* : S x {0, 1} — ► [0, 1] 
be the quadratic loss function defined by H* (x, 0) = H (x) 2 and H* (x, 1) = (1 - H {x)) 2 . Then let C* be 
the p-concept class consisting of H* for all H e C. 
Call ff* e C* an "a-good" function if 



Notice that 



EX \H* (x, b)]<a+ inf EX \C* (x, 6)1 



EX [if* (a;, 6)] = EX (1 - F (x)) H (xf + F (x) (1 - H (x)f 

{x,b)ET>* xEV L 



EX 

xev 



(H (x) - F (x)) 2 + F(x)~F (x) 2 



Therefore, if H* is a-good then 



EX 

xev 



(H(x)-F(x)Y 



Since 



inf EX 

cecxev 



< a + inf EX 

CeCxeV 



(C(x)-F(x)Y 



(C(x)-F(x)Y 



= 0, 



this implies in particular that 



EX 

xev 



(H(x)-F(x)Y 



< a. 
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If we set a := 7 2 e, then the above implies by Markov's inequality that 



Pr [\H(x)-F(x)\ > 7] < £ 

as desired. 

Now suppose we are given m samples (x\, 61) , . . . , (x m , b m ) drawn independently from T>* . Also, let 
Z (x, b) — be the identically-zero function. Then Theorem 12.31 implies that, if we choose H* € C* to 
minimize 

m mm 

W* (xubi) - Z (x h bi)\ = H* {x h h) =J2( H ~ b * 

i—1 i—1 i—1 

then H* will be a-good with probability at least 1 — 6, provided that 



,2 



m = O I — 

a 



^( tat ,(|)wI +1 „ g I)) 



Finally, we claim that fate* (if) < 2 fate (??/2) for all r\ > 0. To see this, let Cq be the p-concept class 
consisti 
clearly 



consisting of H (2) 2 for all H E C, and let C\ be the class consisting of (1 — H (x)) 2 for all H 6 C. Then 



fate* (r?) < fat Co (rj) + fat Cl (77) • 

Also, if Hi (x) 2 — H2 (x) 2 > Tj, then \H\ (x) — H2 (x)\ > r//2. Hence fatc (ri) < fate (v/^)i an d similarly 
fat Cl (77) < fate (r7/2)- ■ 

Proof of Theorem 19.31 We start with part (i). Let k = finec (7,^7), let S = {s±, . . . , Sk} be any set 
that is (7, 77)-nne-shattered by C, and let C* C C be a subclass of size 2 fc that (7, 77)-nne-shatters S. Then 
the function F will be chosen uniformly at random from C*. Also, the distribution T> will choose si with 
probability 1 — 4e, and otherwise will choose uniformly at random from {s2, . . . , fife}. Finally, the learning 
algorithm will be given pt — | — 7 if F 1 (xj) < | — 7, or pi = | + 7 if F 1 (a;,) > \ + 7. 

First suppose m < 4; In Then the m samples si, . . . , s m will all equal s± with probability at least 

(1 -4e)* ln ^ > 2(5. 

Conditioned on this happening, the algorithm certainly fails with probability at least ^. 

Next suppose m < -|^-. Let r be the number of i's such that Xi ^ s\. Then EX [r] = 4em, and hence 

k — ll 4em 1 

4 J ~ (k - 1) /4 ~ 4 

by Markov's inequality. Furthermore, conditioned on r < there are at least | (k — 1) indices « for which 
the algorithm has "no information" about F (aii) — in other words, cannot predict whether F (xi) < h — 7 or 
F (xi) > \ + 7 better than the outcome of a fair coin toss. Yet to output an H such that 

Pr [\H(x)-F(x)\>^]<e<j, 

the algorithm needs to guess correctly for at least of these i's. Again by Markov's inequality, it can do 
this with probability at most iy| = |. Hence it fails with overall probability at least | • ^ = j > 5. 

Part (ii) can be proved along the same lines as part (i); we merely give a sketch. If to < i In then 
the algorithm fails with probability at least S for the same reason as before. Also, as before, the algorithm 
must be able to guess F (sj) with probability at least 1 — 8 for Q (k — 1) indices j S {2, . . . , k}. But this 
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requires m = ( j^^^ j samples xi, . . . , x m . For we can think of each F (sj) as a coin, whose bias (in the 
best case) is either | — (7 + 77) or = + (7 + 77). And it is known that, to decide whether a given coin has 
bias 5 — (7 + rj) or ^ + (7 + 77) with error probability at most (5 -C j, one needs to flip the coin Q ( ^^ n yi ^ 
times. ■ 

Proof of Theorem 19.51 Follows by small tweaks to the proof of Theorem 19.41 found in [TJ. Without going 
into too much detail, in Ambainis et al.'s construction we first choose a random set T = {(tti 7 ri) , . . . , (717, rg)} 
of transformations of a certain covering code, where £ = k 3 . We then use a Chernoff bound to argue that, with 
high probability over the choice of T, each of the k2 k probabilities Tr (Eip y ) satisfies |Tr (Eip y ) — q\ < 1/k 
for some universal constant q > p + 1/k. It follows that there exists a T for which this property holds. 

Now observe that without loss of generality we can make q = p + 1/k. To do so, we simply "dilute" each 
p y to cp y + (1 — c)I, where / is the maximally mixed state and c = p+ ^\fe 2 ■ This already proves the 
theorem in the special case rj = 2/k. If 77 < 2/k, then it suffices to repeat Ambainis et al.'s argument with 
£ = Ak/rj 2 instead of £ = k 3 . m 

Proof of Corollary 19.61 It is clear that any lower bound on k that we can obtain from Theorem 19.51 is 
also a lower bound on finec„ (7,77). Let p :— 7 + i. Then basic properties of the entropy function imply 
that 1 — H (p) < A-f 2 . Also, let k := [n/57 2 J and 77 := 8j 2 /n. Then one can check that the two conditions 



of Theorem 19. 51 are satisfied, as follows. Firstly, k < = ^. Secondly, 

n > b^/ 2 k 

> (l-H(p))k + -/ 2 k 

>{l~H{p))k+^- 1 2 
5 

>(l_iT(p))fc+ILZ- 
5 

> (l- J ff(p))fe + 71og 2 -, 

V 

where the last line uses the fact that 77 > 2~(™ -5 )/ 35 . Hence there exist n-qubit mixed states {Py} ye iQ 
and measurements E\, . . . , Ek such that for all y S {0, l} k and i £ {1, . . . , k}: 

(i) if yi = then | - 7 - r] < Tr (Eip y ) < \ - 7, and 

(ii) if yi = 1 then \ + 7 < Tr [E lPy ) < \ + 7 + 77. 
■ 

Proof of Theorem llO.il As in the proof of Theorem l8.1[ let V* be the distribution over (x, b) G S x {0, 1} 

obtained by first drawing x from 2?, then setting 6 = 1 with probability F(x). Also, given a hypothesis 
H e C, let iJ* (x, 0) = H (x) and 7J* (x, 1) = 1 — H (x). Then let C* be the p-concept class consisting of 
H* for all H £ C. 

Call if* e C* an a-good function if 

EX [H* (x, 6)1 < a + inf EX fC* (x, 6)1 . 

(x,6)ex>* c*ec* ( x .b)ev* 



Notice that if H* is a-good, then 

EX [Ah f (x)1 < a + inf EX [A c F (x)l 

as desired. 

Now, suppose we are given m samples (xi, 61) , . . . , (x m , b m ) drawn independently from V* . Then 
Theorem 12.31 implies that, if we choose H* G C* to minimize 



J2H* (x i ,b i ) = J2\H(x i )-b i \, 
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then H* will be a-good with probability at least 1 — 6, provided that 




Finally, 

MS) 

by the same argument as in Theorem 18. II ■ 
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